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Abstract 

We analyze gene expression time-series data of yeast (S. cerevisiae) measured along two full 
cell-cycles |2|. We quantify these data by using g-exponentials, gene expression ranking and a 
temporal mean- variance analysis. We construct gene interaction networks based on correlation 
coefficients and study the formation of the corresponding giant components and minimum 
spanning trees. By coloring genes according to their cell function we find functional clusters in the 
correlation networks and functional branches in the associated trees. Our results suggest that a per- 
colation point of functional clusters can be identified on these gene expression correlation networks. 

PACS numbers: 87.10.+e, 89.75.-k, 89.75.Hc 
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I. INTRODUCTION 



Gene regulatory networks describe the effective interactions between genes. The activity 
of a gene, i.e., its current rate of being transcribed into RNA molecules, can have effects on 
the activity levels of other genes, which will as a result become up- or down- regulated. The 
sum of all up- and down- regulation relations in the whole genome is the gene regulatory 
network. The complete knowledge of the gene network would reveal a large portion of an 
understanding of life. However, this goal is far from being achieved. With present DNA- 
chip technology it is possible to measure the transcription rates at a given point in time 
of an entire genome, but even these technologies only allow a glimpse on the structure of 
the underlying network, due to the underdeterminedness of the problem Q. This situation 
got the physics community interested, to statistically characterize the available data and to 
(crudely) estimate the structure of the complex networks governing gene dynamics. A step 
toward an identification of potential gene interaction networks is to identify and quantify 
meaningful statistical indicators of gene cooperative behavior, which is the main purpose of 
the present work. The idea is that fluctuations of gene expressions over time, e.g., during 
a cell-cycle, can be considered as an output of an interacting gene collective forming a 
structured network. The hope is that a network structure estimate can be inferred from 
statistical properties. At least it should be possible to statistically characterize the types of 
potential candidate networks. 

We consider the time-course expression data Xi(t) for the genome of yeast S. cerevisiae 
We determine some statistical indicators of collective dynamical behavior of genes, 
such as the g-exponential fit of the cumulative distribution, a ranking distribution and 
a mean-variance analysis of differential gene expressions. We construct and estimate the 
expression-correlation network from time increments of expression data and analyse clusters 
and spanning trees. We identify biological function of genes with use of yeast database 
We find that the resulting, correlation based clusters match considerably well with specific 
biological functions of genes in the cell. 
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FIG. 1: Histograms (a) and ranking of differential gene expressions (b). 

II. SCALE-INVARIANCE IN GENE EXPRESSION LEVELS 

The genome-wide gene expression data in Q are given in the form of a matrix x«(t) in 
which every row represents one of iV = 6406 yeast genes and each column contains the time 
evolution of gene expression of that gene i. Gene expressions are measured at 17 time points, 
taken every 10 minutes which covers approximately two full cell-cycles. We first properly 
normalize the gene expressions for each of the 17 measurements separately by dividing 
each gene expression value by the average value of gene expression for that corresponding 
column. In order to avoid systematic trends in the time series we use differential expression 
data defined as Axi(t) = Xi(t) — Xi(t — 1) for each gene i. We determine the cumulative 
distribution P(> Ax) for each time-interval separately and also all measurements (all entries 
in matrix). The results are given in Fig. 1 a. This distribution can be fitted to a g-exponential 
form y], 

i 

1-9 ; <z^i , (i) 

where q represents the non-extensivity parameter. The fitted values of q for the various 
time-intervals are in the range 1.52 — 1.63. The average over all times yields q = 1.55, 
potentially indicating a non-trivial collective behavior of genes along the cell-cycle. In Fig. 
1 b the ranking distribution is shown for genes according to their differential expressions at 
a particular time (lower curve) and for all measurements (upper curve). In both cases these 
curves exhibit approximate power- law regions, i.e. Zipf's law [5J]. The occurrence of Zipf's 
law has been found in the ranking of expression data of many other species pf. The results 
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FIG. 2: Time fluctuation Oi plotted against time-averaged gene expression (xi)t, for all ./V genes. 

in Fig. 1 b indicate that the characteristic form of the distribution, in particular its slope, 
more or less persists even when ranking is averaged over all-time measuremets. 

Gene expression levels fluctuate during a cell-cycle. We calculate its temporal mean, 
(xi) t = jj J2t x i(t) an d its variance o-j = yj (xi 2 ) t — (xi) 2 t , for all genes i — 1, • • • N. In many 
dynamical systems a relation between those quantities is found to be of the form 

Oi ~ (xiY. (2) 

In the case of driven dynamical systems on networks the scaling relation Eq. (J2J) holds 
when the values of fi depend on both, the network topology and the driving conditions. In 
particular, many real networks seem to fall into two 'universality classes' [7j: \i = 1, for 
example for scale free tree graphs and cyclic structures, and fi = 1/2, often found in weakly 
driven cyclic graphs. In Fig. 2 the temporal variance a,i of the expression level x,i{t) is 
plotted against its temporal mean (xi) t for each gene. The data yields a slope of /i ~ 0.89 
which suggests a heterogeneous network of genes with highly driven dynamics. 
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FIG. 3: Section of the correlation matrix Wij (a). Size of the giant cluster function of 

the threshold Wq for equal-time correlations, At = 0, (b). Inset: histogram of Wij for different 
time lags, At = 0, 3 and 5. 

III. GENE EXPRESSION NETWORKS 
A. Construction of gene networks 

Measures of the statistical indicators, do not identify the network topology, however, they 
suggest that some collective phenomena seem to occur, which could be thought of grouped 
up- or down- regulations within 'clusters' of genes. 

As a first attempt we construct a 'gene expression network' from correlation coefficients 
of temporal differential gene expressions Axj(i), 

Wj (At) - EA^M ~ <Ax f ))(Ax j (t + At) - (Ax,)) 

A section of this correlation matrix is shown in Fig. 3 a. The histogram of the correlation 
coefficients for all pairs of genes are shown in the inset to Fig. 3 b for several time lags, 
At = 0, 3 and 5. These distributions of correlation coefficients clearly exhibit a non-Gaussian 
character. To define a network we select a threshold Wq. A link is denned to exist between 
genes i and j if their correlation exceeds the threshold, W%j > Wq. By systematically 
decreasing Wo we observe the formation of a giant component, whose size S max is plotted 
against the threshold Wq in Fig. 3 b. The conditions for the formation of the giant cluster 
(k 2 ) — 2(k) > ^2 k k(k — 2)P(k) > are fulfilled at rather large values of the threshold. 



The size of the giant cluster increases first linearly by decreasing Wo, until an inflection point 
is reached at Wo ~ 0.97 (arrow in Fig. 3 b). The steep increase below this point resembles 
a percolation-like behavior in which the network gradually becomes complete in the range 
0-95 < W < 0.97. 

B. Clusters and trees 

A particular way to statistically characterize the network topology is to study different 
types of connected clusters supported by that network. In Figs. 4 a and c we show all clusters 
remaining at a thresholds of Wq = 0.93 and Wo = 0.90, respectively. A minimum cluster size 
of 10 nodes was chosen. Individual genes are nodes, colored according to their cell function 
P); the color map is described in the caption of Fig. 4. In Figs. 4 b and d the minimum 
spanning trees, which are constructed from the 'distance' dij = a/2(1 — Wij) are shown (see 
e.g j^|). The maximum spanning trees computed from Wij directly lead to very similar trees 
(not shown), indicating that most of the dynamics is driven by positive correlations. For 
the threshold values in the range 0.9 < Wo < 0.97, apart from the giant cluster a number 
of smaller clusters is present. By color-coding according to the biological functions in the 
cell [10], (of which a large fraction is known for yeast P]), grouping of genes into clusters 
occurs, suggesting that the gene expression correlations in Eq. (jHJ) captures functionally 
similar genes. By varying the threshold Wq in the range between 0.9 < Wo < 0.97, we 
detect the appearance of a color-grouping shortly below the inflection point W ~ 0.97. 
Color-groupings than increases with lowering the threshold. For comparison, in Figs. 4 
c and d we show the situation at Wo = 0.9, where, apart from very small clusters which 
are removed from the figure, many new genes joined the giant component. Its minimum- 
distance spanning tree is also shown in Fig. 4 d. Genes with certain functions, in particular 
the 'protein synthesis' and 'cellcycle' function seem to appear in rather cohesive subgroups 
of the network. Genes with predominant 'metabolism' functions, appear more dispersed 
over different branches. All these plots are obtained for At = 0. For At > the functional 
clusters and branches remain for a while before they gradually disappear in the noise. This 
is in agreement with the observed character of the correlation distributions in Fig. 2, where 
smaller deviations from a Gaussian distribution are found for At > 0. 
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FIG. 4: Gene expression networks (a and c) for Wo = 0.93 (top) with a minimum cluster size of 
10 nodes and Wo = 0.90 (bottom). Minimum spanning trees (b and d) for the distance measure 
djj and same values of Wq. Genes are colored according to their functions they fulfill in the cell 
[3j: yellow-metabolism; pink-energy household ; red-cellcycle; blue-transcription; purple-protein 
synthesis white-cellular transport/rescue; black-celltype/development; green-unknown. 

IV. CONCLUSIONS 

In conclusion, we made several observations about the statistical nature of gene expression 
data which seem to suggest that at least a significant fraction of genes is up/down regulated 
in a highly collective manner. Indicators pointing in this direction are: (i) the cumulative 
distribution of differential gene expressions can be fitted to g-exponentials, with a non-trivial 
q ~ 1.55; (n) an approximate Zipf's law holds in the ordering distribution of differential 
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expressions; (ra) an almost linear mean variance dependence with \x = 0.89 signals tightly 
driven dynamics; (iv) the correlation matrix element distributions are non-Gaussian and non- 
Poisson and finally, (v) even crude correlation coefficient network displays the emergence of 
clusters and functional branches in minimum spanning trees, which seem to be biologically 
relevant. 
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